Constrained decoding with weighted automata old

I want to share a take on grammar-constrained generation I've been working on for a while. This post is a preliminary writeup: enough detail to communicate the core read-GLR-stack-with-a-weighted-automata idea (and timestamp it). I’ll likely follow up with a post on the various tricks you need to actually build the damn things without exploding.

The problem

You have a grammar (often a JSON Schema) and you want the LLM to produce grammatically valid output.

At each step, the LLM outputs logits over the vocabulary $V$ . You sample a token. Append it to the output. Repeat.

Constrained decoding is: don’t sample tokens that would make the output grammatically invalid.

So we maintain a constraint state and, for each step, compute a mask $M \subseteq V$ of allowed tokens, and apply it to the logits.

No more broken JSON.

Two primitive operations

Think of constrained decoding as two functions:

get_mask(state) returns a bitmask $M \subseteq V$ over the LLM vocabulary.
commit(state, token) updates the parser state with the chosen token.

At runtime the loop looks like:

loop:
  parallel:
    GPU: logits
    CPU: mask
  apply mask
  sample token
  commit token to model and constraint

Why worst-case latency matters

LLM inference is batched: multiple user requests run on the GPU at once. And mask generation sits directly on the critical decoding path.

If you want to generate 1k tokens per second, you have 1 millisecond to produce a mask. And if you miss even a single one - if a mask takes 2ms instead of 1 - it screws up scheduling; the GPU sits waiting on the straggler mask, and suddenly your p99 latency is trash.

Masks can't be late; they have to arrive on time every time. It's worst-case performance that matters.

It’s not “can you do masking fast on average?” It’s “can you do it fast every single step?”

Commit ~ incremental parsing

The commit op is analogous to incremental parsing: you’re incrementally parsing a growing prefix.

Many battle-tested incremental parsing systems like tree-sitter and Lezer use Generalized LR (GLR) parsing with a graph-structured stack (GSS).

An LR parser maintains a stack of parser state IDs.
It decides what to do by looking at the top of stack plus the next terminal symbol.
The “Generalized” part handles ambiguity: instead of a single stack, you keep a graph of possible stacks sharing structure (the GSS). When ambiguity arises, stacks fork; when they converge to the same state, they merge; error paths are pruned.

LLM Token: An element of the model vocabulary—a BPE token with an ID and byte string. ~200k of these in modern tokenizers.
Grammar Terminal: The symbols your parser actually consumes. Might be bytes, characters, or lexer tokens like --.
Parser State ID: An integer identifying the LR automaton's current state. The parser stack is a sequence of these.

Generating masks

Now you have a parse state (often a GSS), and you want a mask over LLM tokens.

A straightforward approach is:

For each LLM token $t$ , check whether appending it keeps the parse valid.

But you don’t append an LLM token to the parser. You append terminal symbols.

And a single LLM token is a byte string which might:

expand into multiple terminals ("}\n" could lex as RBRACE NEWLINE),
end partway through a terminal (halfway through a string literal, halfway through a number, halfway through true),
or be lexically ambiguous

So “try token $t$ ” is really “try every way $t$ ’s bytes could lex into some terminal sequence, and then try advancing the parser for that terminal sequence.”

Tries/trellises and why they still hurt in p99.9 land

A common direction is:

Build a trie/trellis representing how LLM tokens map to terminal sequences.
Traverse that trie while advancing the parser state, pruning invalid paths.
Collect the set of LLM tokens that survive.

This can work well for many grammars and workloads. People implement this and get decent average performance, especially for “simple” grammars.

But for general CFG-ish constraints (especially when you care about worst-case), you run into two issues:

1) You end up doing real parser work “inside” masking

Even if your lexical side is deterministic (say you enforce longest-match, or you have a DFA lexer), a single grammar terminal shift can trigger a chain of reductions.

In an LR parser, “shift terminal $t$ ” is often really:

while top-of-stack wants to reduce, do some reductions (pop some states, push a goto state),
then shift $t$ .

Those reduction chains are table-driven and fast in the happy path, but in GLR they can fan out. The operations are pointer-heavy and cache-unfriendly: you’re manipulating a GSS, merging nodes, pruning branches, etc.

If you do that inside masking—i.e. while traversing a trie of possible tokens—you’re effectively building and mutating a GSS for a hypothetical next token, and doing it potentially thousands of times per step.

That’s exactly the kind of work you don’t want on the critical path.

2) You still pay for long/ugly tokens

Even with tries, there are some tokens whose byte strings correspond to exceedingly long sequences of terminals. Sometimes these get quickly pruned, e.g. ????. But not always.

Lexing (grammar tokenization): Splitting source text into grammar terminals like MINUS, NUMBER, IDENT. This is what your parser consumes.
Tokenization (BPE/LLM tokenization): Splitting text into LLM tokens—subword units from the model's vocabulary (~200k entries). This is what the model produces and what we're masking.

For example, Python will happily accept monstrosities like:

>>> ----------------1 1

The sixteen hyphens get tokenised (BPE) as a single LLM token:

----------------ID: 75351ID: 16

But token 7535 lexes to 16 MINUS terminals:

MINUSMINUSMINUSMINUS… (×16 total)

Every time a mask is generated, the parser must ingest all 16 MINUS terminals of token 7535. This forces it to do a lot of terminal-level processing—16 shifts through the grammar, each of which may involve many reductions—triggered by a single LLM token.

You can transform the grammar to remove unit/null reductions. You can aggressively merge and simplify GSS nodes. You can add clever memoization, early exits. But in my experience, these pesky LLM tokens with complicated lexes are just really hard to engineer away.

I tried hard to make the “trie + incremental parse simulation” approach behave well in worst-case latency terms. In my experience, it’s a dead end if you’re aiming for predictable sub-millisecond masking on arbitrary inputs/grammars.

Invert the problem

Token validity depends on what's on the stack. Instead of asking "does LLM token t work on this stack?" for each token, ask "given this stack, which LLM tokens work?"

That's the reframing. Now we need a data structure that turns "stack → allowed tokens" into something we can execute quickly.

That’s where the weighted automaton comes in.

Weighted automata over parser states

Think of a finite automaton, except each transition carries a weight, and traversing paths accumulates weights.

In our weighted automata:

Input symbols are LR parser state IDs (i.e. elements of the parse stack).
Weights are bitsets over the LLM vocabulary.
Traversing a transition applies an intersection (filter tokens).
Merging paths onto one state applies a union (tokens valid via any path).

The automaton reads (a representation of) your current parse configuration:

For a single LR stack: the sequence of state IDs from top to bottom.
For GLR: a GSS, i.e. a DAG representing many possible stacks sharing suffixes.

A token is valid if it’s valid on any parse path in the GSS, so on a GSS you just union the results across paths.

Another way to see it

Another way to see it (which helped me):

Fix an LLM token $t$ .
Consider the set of all possible lexes of $t$ (each lex is a terminal sequence).
For each such lex (terminal sequence), consider all stack configurations from which that sequence can be legally parsed without error.

That set of stacks is a regular language over parser state IDs (this is closely related to the classical “viable prefixes are regular” result from LR theory).

So you can imagine an automaton $A_t$ that recognizes “stacks on which token $t$ is valid.”

Of course, if you do that for every $t \in V$ , you’d have 200k automata. That’s useless at runtime.

A weighted automaton is how you smash those 200k membership tests into one run:

Transitions are annotated with “which tokens could accept if we take this transition.”
Running it once on the stack gives you exactly: “which tokens’ $A_t$ would accept this stack.”

Same computation, but 'vectorized' across tokens via bitsets (although I find rangesets perform better than bit vectors, so maybe 'vectorized' isn't quite accurate).

A nice optimization: stop reading the stack early

In practice you rarely need the full stack. As you scan states, each token’s fate becomes fixed: it either already reaches an accepting state (so it will stay valid) or it has been filtered out forever. Once a token is decided, pushing it deeper won’t change anything.

So maintain a “decided” set as you go. Peel those tokens off the frontier. When everything left is decided, you can stop immediately—no need to read the rest of the stack.

Runtime sketch

For a single LR stack, the runtime looks like:

def get_mask(stack_state_ids_top_to_bottom):
    # frontier: map automaton_state -> bitset(tokens)
    frontier = { A.start: ALL_TOKENS }

    for sid in stack_state_ids_top_to_bottom:
        new = {}
        for a_state, tokens in frontier.items():
            for (a2, weight) in A.step(a_state, sid):
                tokens2 = tokens & weight  # ∩ (filter)
                if tokens2.any():
                    new[a2] = new.get(a2, EMPTY) | tokens2  # ∪ (merge)
        frontier = new

        if decided(frontier):
            break

    return combine_accepting(frontier)

This is the key operational point:

You are not iterating over the vocabulary.
You are not simulating big chains of reductions at runtime.
You are doing:
- transition-table lookups,
- bitset intersections,
- bitset unions at merges.

Runtime on a GSS

For a GSS, you do basically the same computation, but over a graph rather than a single list. Conceptually:

each edge in the GSS is labeled by a parser state ID,
each node represents a shared suffix,
you propagate ∩∩/∪∪ through the product of:
- GSS structure (many stacks),
- automaton transitions.

A token is valid if it’s valid on any stack path, so whenever GSS paths merge you union their token sets, and whenever automaton paths merge you union there too.

The important part is: it’s still the same “bitset flows through a graph via ∩ and ∪” pattern. No backtracking, no “try token $t$ ” loop.

For a GSS, you do basically the same computation but over a graph rather than a single list. (Details omitted here, but the point is: it’s the same ∩/∪ propagation pattern.)

Why this is the right kind of work for p99.9

Mask computation becomes:

A scan over the relevant portion of the stack/GSS
With ops that are mostly:
- transition-table lookups
- bitset intersections
- unions on merge

Critically:

You’re not iterating over the vocabulary.
You’re not traversing a token trie driven
You’re not simulating big chains of reductions.
You're not reading further down the stack then you need to

Work is proportional to stack structure, not the grammar complexity or vocabulary size.

Putting it together

commit(state, token) uses GLR machinery:

update the GSS for the actual emitted token
keep ambiguity compact

get_mask(state) uses the weighted automaton:

treat the current GSS as the “input”
propagate token bitsets through the automaton with ∩/∪
return a vocabulary mask

So you get a split where:

commit is responsible for advancing the parser state with a chosen LLM token
get_mask is responsible for cheaply answering “what LLM tokens could be next?”

Why this feels close to optimal

I’m avoiding “this is optimal” as a theorem. (Blah blah blah I mean GLR doesn't really have 'this is optimal' theorems, but it feels optimal. Maybe just because so many people use it. Maybe it's just because it's fast. Maybe there's some deeper reason. IDK. But it does feel optimal.)

On the commit side, GLR-style techniques are already a strong fit for incremental CFG parsing. Table-driven decisions, compact ambiguity representation via GSS. There's a reason people working on these kinds of problems keep converging on GLR.

On the mask side, the weighted automaton performs the minimum work you'd reasonably hope for: read the stack once, never backtrack, stop when determined. Inner loops are bitset intersections and unions—fast, predictable.

This doesn’t magically resolve the lack of strict bounds on GLR parser time complexity, behaviour in pathological cases. You still inherit those. (Which are?? Theorems? Lack of theorems proving efficiency?) But... blah blah.

Fast run, slow compile

Precompiling the grammar into a weighted automaton involves determinization and minimization in a semiring setting. Large grammars can blow up if you're careless.

Getting compile-time and memory to behave took most of my engineering effort. That's probably the next post.